-
Notifications
You must be signed in to change notification settings - Fork 3.6k
feat: separate out write path executor with unbounded memory limit #26455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
946233a
to
73f96ce
Compare
Currently when there is an OOM while snapshotting, the process keeps going without crashing. This behaviour is observed in main (commit: be25c6f). This means the wal files keep increasing to a point that restarts never can replay all the files. This is happening because of the distribution of memory, in enterprise especially there is no need for an ingester to be allocated just 20% for datafusion memory pool (which runs the snapshot) as parquet cache is not in use at all. This 20% is too conservative for an ingester, so instead of redistributing the memory settings based on the mode it's running, a separate write path executor is introduced in this commit with no bound on memory (still uses `GreedyMemoryPool` under the hoold with `usize::MAX` as upper limit). This means write path executor will always run into OOM and stop the whole process. Also, it is important to let snapshotting process use as much memory as it needs as without that, the buffer will keep getting bigger and run into OOM anyway. closes: #26422
73f96ce
to
ed5d40c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the results of splitting the executors are good. But, I am hesitant to move forward on this in light of the comment on the metrics registry.
What was the reason that we can't have them share a registry? Maybe we can leave this PR as is, and figure out how to recombine them in follow-on work.
My understanding is, in the current state, the "query" like metrics produced by the write path's executor would not be reported by the prometheus /metrics
HTTP API.
That may not be that big of a deal, especially if we are reporting a tailored set of metrics for Core/Enterprise write path from our own code (e.g., we report ingest rate).
// When you have extra executor, you need separate metrics registry! It is not clear what | ||
// the impact would be | ||
// TODO: confirm this is not going to mess up downstream metrics consumers | ||
let write_path_metrics = setup_metric_registry(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will likely be an issue, since the HTTP endpoint that serves prometheus (/metrics
) assumes a single registry:
influxdb/influxdb3_server/src/http.rs
Lines 734 to 740 in be25c6f
fn handle_metrics(&self) -> Result<Response<Body>> { | |
let mut body: Vec<u8> = Default::default(); | |
let mut reporter = metric_exporters::PrometheusTextEncoder::new(&mut body); | |
self.common_state.metrics.report(&mut reporter); | |
Ok(Response::new(Body::from(body))) | |
} |
Is the issue that using the same registry for multiple executors causes them to overwrite each other, or contend for locks with each other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It runs into panic,
2025-06-02T14:49:39.205230Z ERROR panic_logging: Thread panic panic_type="unknown" panic_message="More than one execution pool created: previously existing instrument" panic_file="/home/praveen/.cargo/git/checkouts/influxdb3_core-2ede6fca005e1dcf/fd0e474/iox_query/src/exec.rs" panic_line=281 panic_column=9
thread 'main' panicked at /home/praveen/.cargo/git/checkouts/influxdb3_core-2ede6fca005e1dcf/fd0e474/iox_query/src/exec.rs:281:9:
More than one execution pool created: previously existing instrument
stack backtrace:
0: 0x629f44b5e172 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::hc04c8f544ab24d66
1: 0x629f44b8eb63 - core::fmt::write::hfe57b7174b7d8eab
2: 0x629f44b595a3 - std::io::Write::write_fmt::h154385efa8565236
3: 0x629f44b5dfc2 - std::sys::backtrace::BacktraceLock::print::h0c8f24e22f5873a8
4: 0x629f44b5f24c - std::panicking::default_hook::{{closure}}::hd07d57e6a602c8e4
5: 0x629f44b5f04f - std::panicking::default_hook::h63d12f7d95bd91ed
6: 0x629f3fd807db - panic_logging::SendPanicsToTracing::new_inner::{{closure}}::h4f1478e3035af477
7: 0x629f44b5fd43 - std::panicking::rust_panic_with_hook::h33b18b24045abff4
8: 0x629f44b5f9c6 - std::panicking::begin_panic_handler::{{closure}}::hf8313cc2fd0126bc
9: 0x629f44b5e679 - std::sys::backtrace::__rust_end_short_backtrace::h57fe07c8aea5c98a
10: 0x629f44b5f68d - __rustc[95feac21a9532783]::rust_begin_unwind
11: 0x629f44b8bac0 - core::panicking::panic_fmt::hd54fb667be51beea
12: 0x629f414b6cb7 - iox_query::exec::Executor::new_with_config_and_executor::h3ef1059edcb25ade
13: 0x629f3f99fd9c - influxdb3::commands::serve::command::{{closure}}::h2cdf5ca9df83df25
14: 0x629f3f9b6fcb - influxdb3::main::{{closure}}::hc953cfc298ca6770
15: 0x629f3f987b39 - tokio::runtime::park::CachedParkThread::block_on::h51b18ac33f8a0e4d
16: 0x629f3fb2a7bf - tokio::runtime::runtime::Runtime::block_on::h9eb33b87acb6fa53
17: 0x629f3fc20d55 - influxdb3::main::h75a268e75e689bc6
18: 0x629f3fcbb256 - std::sys::backtrace::__rust_begin_short_backtrace::h5b4e77177edb3cca
19: 0x629f3fab4321 - std::rt::lang_start::{{closure}}::hc69eb1d94c6de306
20: 0x629f44b4e080 - std::rt::lang_start_internal::h418648f91f5be3a1
21: 0x629f3fc3b19d - main
22: 0x7ed71c33d488 - <unknown>
23: 0x7ed71c33d54c - __libc_start_main
24: 0x629f3f95b325 - _start
25: 0x0 - <unknown>
2025-06-02T14:49:39.295724Z WARN executor: DedicatedExecutor dropped without calling shutdown()
2025-06-02T14:49:39.296308Z WARN executor: DedicatedExecutor dropped without calling shutdown()
I can look into the panic and see if I can address that in a different way if this is going to mess downstream consumers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked into the code that runs into panic
It looks like there is an assumption that you violate single memory pool -> executor relationship if the "datafusion_pool" is already registered. Even though we don't break that relationship, i.e in this branch there are two executors and each has it's own memory pool so the relationship is still correct but because registry is shared it runs into this error.
I need to spend a bit more time to see if I can create the executor outside without hooking it up to metrics to start with (or use a different name for instrument "datafusion_write_pool") and then experiment with how it's reporting.
// These are new additions, just skimming through the code it does not look like we can | ||
// achieve the same effect as having a separate executor. It looks like it's for "all" | ||
// queries, it'd be nice to have a filter to say when the query matches this pattern | ||
// apply these limits. If that's possible maybe we could avoid creating a separate | ||
// executor. | ||
per_query_mem_pool_config: PerQueryMemoryPoolConfig::Disabled, | ||
heap_memory_limit: None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, these came in from iox
in the most recent sync. They aren't used anywhere in our code yet. I opened #26460 to address using them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@praveen-influx and I discussed this.
The sentiment was that to enable the use of a single Registry
across multiple Executor
s would require some upstream work in iox
. The issue this change addresses is more important than the drawback of having an unused metric registry, so we can ship this and deal with the latter part in follow-on work.
We can open an issue in iox
to summarize the issue and propose a change to support optional or namespaced metric registry on executor creation.
Thanks @hiltontj - I'll merge this in and create a separate issue to track how to expose this write path memory pool metrics. |
Created this #26484 issue to address the follow up work |
Currently when there is an OOM while snapshotting, the process keeps going without crashing. This behaviour is observed in main (commit: be25c6f). This means the wal files keep increasing to a point that restarts never can replay all the files.
This is happening because of the distribution of memory, in enterprise especially there is no need for an ingester to be allocated just 20% for datafusion memory pool (which runs the snapshot) as parquet cache is not in use at all. This 20% is too conservative for an ingester, so instead of redistributing the memory settings based on the mode it's running, a separate write path executor is introduced in this commit with no bound on memory (still uses
GreedyMemoryPool
under the hood withusize::MAX
as upper limit). This means write path executor will always run into OOM and stop the whole process.Also, it is important to let snapshotting process use as much memory as it needs as without that, the buffer will keep getting bigger and run into OOM anyway.
closes: #26422
Test
TestServer
does not yield reproducible runs. Perhaps a separate suite that runs longer should be considered../target/quick-release/influxdb3_load_generator write-fixed --tput 5.42 -w 40
(note: thiswrite-fixed
is a sub-command in load generator in enterprise - sitting in this PR at the moment, once it's merged I'll port it back to core)Main branch
- 1G max memory, 500M for forcing snapshot, 200M for DF executor
Branch with separate write path executor
1G max memory, 500M for forcing snapshot, 200M for DF executor
but also, write path executor has unbounded memory (which means it'll only die if the whole process runs out of memory)systemd-cgtop system.slice/run-p3076443-i3076743.scope
logs to see that it actually released the memory after each snapshot